Title: Analysis of Crime and Economic Status in Toronto Neighbourhoods (2016-2021)¶
1. Motivation¶
Toronto is recognized as one of Canada’s most diverse and economically important cities, with its financial, technological, and cultural sectors growing at an increasing pace. Income disparities across the neighbourhoods of Toronto could be one reason for their different crime rates. The cycles of poverty and crime that plague the city of Toronto may be due to some structural inequalities, in this case, a lack of affordable housing, also unemployment and limited educational opportunities. It is out of concern regarding these structural inequalities that motivates our group to investigate the relationship between criminal activity and economic status in the neighbourhoods of Toronto, among other possible underlying factors we wish to explore. Understanding how economic hardship relates to crime will tell us how good or bad economic conditions in one area bolster or dampen rates of criminal activity and inform us about areas that might need policy intervention.
2. Review of Similar Research¶
The relationship between crime rates and poverty is intricate, as areas with lower socio-economic status are more likely to have criminal activity. According to Statistics Canada, the violent crime severity index was 41 percent higher in 2023 compared to 2014 with a reported surge in Canada’s largest urban centres (Lau, 2024). Toronto is reported to be facing a poverty epidemic, with residents facing both social and economic crises. A 2021 census showed that 13.2% of all Torontonians live in poverty, with higher levels being among immigrants and racialized individuals, especially those from Indigenous communities. (Social Planning Toronto, n.d.). While levels also seem to be more concentrated in the downtown region, poverty is unevenly distributed among the city. From these data sources, it is evident Toronto is experiencing both rising crime rates and persistent economic equality.
The Government of Canada (2022) presents research in criminology studies, communicates the strong association between social and economic disadvantage and crime. North America and Europe have used data mapping and analysis techniques to investigate the relationship between criminal activity and economic status. Data collected has shown that the most serious offences include assault, robbery, and homicide, and offenders are often employed or employed in low-paying, unskilled jobs. Further, they highlight Ley and Smith’s (2000) research, which conveys the association between crime and social deprivation. The higher crime areas in central Toronto closely align with the most economically deprived neighbourhoods. Overall, areas with higher criminal activity were found to have lower socio-economic status, less stable residential communities, higher population densities, and specific land usage patterns that may create more opportunities for crime.
Wirdzek (2024) explored how crime rates in urban settings can be impacted by income inequalities. Similar to the previous sources we explored, the research indicated that increased crime rates correlated with not only lower income statuses, but a variety of socio-economic factors, such as limited access to quality and education, and social disorganization in neighbourhoods. The study had employed a desk methodology, where secondary data was analyzed from existing research. There was no explicit mention of using statistical modelling within the author’s research framework. Some of the mentioned studies had employed statistical modelling techniques to analyze correlation, and others had used both quantitative data analysis with qualitative data to establish a connection. However, implementing statistical modelling into the primary analysis could enhance the robustness of their conclusions and provide quantitative evidence to back up their claims. By integrating statistical methods and models, our group hopes to improve upon this report and take a more data-driven approach to strengthen the validity of our claims.
Finally, using data from 2012 to 2021, Mohammadi et. al (2022) conducted a study to identify patterns between homicide rates and a variety of socio-economic and environmental factors across Toronto neighborhoods. Their research process employed various spatial analysis techniques, including Geographically Weighted Regression (GWR) and Empirical Bayes Smoothing, to investigate the crime patterns. They found that there was significant clustering of high homicide rates in the downtown and northwest parts of the city, which could have been attributed to factors such as population density, material deprivation, and the density of commercial establishments. While our group does not plan to use the exact same techniques, we plan to also analyze different datasets and apply statistical modelling to explore crime distribution patterns.
3. Research Question¶
Taking our motivations and the existing research into consideration, our group formulated the following research question:
What is the relationship between household income levels and crime rates across Toronto neighborhoods, and how do these trends vary by type of crime?
4. Data¶
4.1 Data Sources and Features¶
This study takes information sourced from Toronto’s Open Data Portal, which is the go-to place for city divisions and agencies to share data publicly.
Crime Data:
- Source: Dataset
Neighbourhood Crime Rates, published by the Toronto Police Services. - Features: Recorded totals for assault, auto theft, break and enter, robbery, theft over $5000, homicide and shooting/firearm discharges. Also included are crime rates per 100,000 population as calculated by Environics Analytics.
- File:
raw_data/neighbourhood-crime-rates - 4326.csv
- Source: Dataset
Socioeconomic Data:
- Source: Dataset
Neighbourhood Profiles, published by Social Development, Finance & Administration. This data comes from the Census of Population (which happens every five years) and it collects demographic social and economic data for residents and households in each neighbourhood of Toronto. - Files:
raw_data/neighbourhood-profiles-2016-140-model.csvandraw_data/neighbourhood-profiles-2021-158-model.csv - Key Features Used: Population counts, age splits (young, working time, old ratios), typical/medium age, different money measures (medium, typical, before-tax, after-tax for 2020), low-income state percentage.
- Source: Dataset
We will primarily focus on the relationship between income indicators (like Median_Income_2020, Average_Income_2020, Median_After_Tax_Income_2020, Average_After_Tax_Income_2020) and crime counts/rates for Assault, Auto Theft, Bike Theft, Break and Enter, Robbery, Theft from Motor Vehicle, and Theft Over $5000 for the years 2016 and 2021 where available and comparable.
4.2 Data Quality Assessment and Preparation¶
Several steps in data preparation were carried out to ensure quality and facilitate meaningful comparisons between the 2016 and 2021 datasets. Cleaning and preparation logic is encapsulated in Python scripts (clean_crime_data.py, clean_neighborhood_profile_data.py).
Neighborhood Reconciliation: This became necessary because of changes in boundaries and naming inconsistencies between the 2016 (140 neighborhoods) and 2021 (158 neighborhoods) profile datasets. Standardization included names (removing abbreviations, handling variations like 'St.' vs 'Street') and reconciling neighborhoods across the years to have common geographic units for comparison. The analysis presented here focuses on neighborhoods that are available and consistently identifiable in both datasets. Initial analysis suggested about 122 common neighborhoods; final usable neighborhood-identifiable pairs in the report text were slightly more than that with complete data available after handling missing values.
Data Cleaning:
- Missing Values: Where appropriate, missing crime counts were assumed to be zero (interpreting no report as zero incidents for that period). The handling of missing data for demographic variables was done with care; rows were excluded where critical information was missing and reliable imputation seemed unlikely.
- Column Selection: Just the relevant columns for the analysis (identifiers, demographic features, income metrics, crime counts/rates) were kept.
Variable Transformation: Created as needed, particularly for comparing changes over time (though the primary analysis focuses on separate models for 2016 and 2021), derived variables. Key variables that were used directly from the cleaned datasets are population totals, age ratios, income metrics, and crime counts/rates per type.
Merged versions of the cleaned crime and profile data for each respective year (2016 and 2021) were the final datasets used for modeling.
4.3 Descriptive Statistics Overview¶
The final dataset used for analysis covers Toronto's diverse neighborhoods, showing substantial variation in both demographic characteristics and crime rates between 2016 and 2021.
(Note: The statistics below are cited from the provided text report and reflect the state of the data after cleaning and merging for the analytical sample, likely the 122 neighbourhoods with complete data.)
- Population: Neighbourhood populations vary significantly, ranging from approximately 6,000 to 85,000 residents. The mean population per neighbourhood increased slightly from ~22,135 in 2016 to ~22,940 in 2021.
- Low Income: The average percentage of residents classified as low income across analyzed neighbourhoods decreased from 19.5% in 2016 to 12.3% in 2021. (Note: This reflects the specific definition of 'low income' used in the census data).
- Crime: Total crime counts per neighbourhood exhibit high variability. The mean number of total reported incidents (across the analyzed crime types) per neighbourhood was approximately 298 in 2016 and increased slightly to 312 in 2021. Specific crime types showed different trends (e.g., Auto Theft counts increased significantly on average, while Robbery counts decreased).
- Improvement Areas: 25 neighbourhoods (approx. 14.2% of the original set, percentage might differ slightly for the final 122 sample) were designated as Neighbourhood Improvement Areas (NIAs), a status consistent across both periods in the source data.
These descriptive statistics highlight the dynamic nature of Toronto's neighbourhoods and provide context for the subsequent modeling efforts.
from IPython.display import Image, HTML, display
# Display map of Neighbourhood Improvement Areas
display(HTML("<h3>Neighbourhood Improvement Areas Map</h3>"))
display(Image(filename='output/toronto_neighborhoods_improvement_areas.png', width=800))
# Link to interactive version (optional, requires user interaction)
# display(HTML('<a href="output/toronto_neighborhoods_improvement_areas_interactive.html" target="_blank">View Interactive Map</a>'))
# Display comparison plots from box_plot directory
display(HTML("<h3>Descriptive Comparisons (2016 vs 2021)</h3>"))
display(Image(filename='output/box_plot/crime_counts_comparison.png', width=800))
display(Image(filename='output/box_plot/crime_percent_change.png', width=800))
display(Image(filename='output/box_plot/population_income_comparison.png', width=800))
Neighbourhood Improvement Areas Map
Descriptive Comparisons (2016 vs 2021)
5. Methodology¶
This section outlines the analytical strategy employed in examining the relationship between household income levels and crime rates across the Toronto neighbourhoods, in answer to the research question. The methodology conducts data processing, exploratory data analysis (EDA), and statistical modeling using Python.
5.1 Overall Approach¶
The current study adopts a quantitative methodology, largely involving the use of statistical analysis to explore correlational and predictive relationships between socioeconomic variables (primarily income) and measures of crime (counts and rates). Data from two different time points (2016 and 2021) are examined for evidence of change in these relationships.
5.2 Data Processing and Analysis Workflow¶
The workflow involved several stages, using various different Python scripts we created:
- Data Cleaning: As described in Section 4.2 above, where Toronto Open Data raw data was cleaned and standardized using scripts
clean_crime_data.pyandclean_neighborhood_profile_data.py, yielding consistent datasets for the years 2016 and 2021. - Data Merging: The cleaned crime data and neighborhood profile datasets were merged for each year based on standardized neighborhood names to create comprehensive analytical datasets (
merged_2016.csv,merged_2021.csv). - Exploratory Data Analysis (EDA): A visual and statistical check was done to learn about how data spreads, find outliers, see if there are problems, and look for possible links. This was performed using scripts like
box_plot_2016.py,box_plot_2021.py,correlational_matrix.py, andcreate_visualizations.py. - Statistical Modeling: Based on EDA results, linear regression was picked as the main modeling method.This was implemented in
lin_reg.pyusing thescikit-learnlibrary. - Results Interpretation: The output of the model (R² scores, coefficients) and its visuals (scatter plots, actual vs predicted plots) were used to analyze the research question.
5.3 Justification of Analytical Methods¶
Exploratory Data Analysis (EDA): EDA was key in generating hypotheses and choosing models.
- Box Plots: Helped visually compare the distribution of crime across income quartiles and highlight possible trends or shifts that may have occurred between 2016 to 2021.
- Correlation Matrices: Offered a quantitative summary of the linear relationships that exist between various socioeconomic indicators (population, age structure, income) and types of crime counts and rates. It gave an overview of which variables showed the strongest associations.
- Other Visualizations: Scatter plots, and possibly maps (such as the heatmap referred to later), aiding in understanding spatial patterns and direct relationships between key variables (low income percentage vs. crime count).
Linear Regression: Mulitvariate linear regression was carried out despite EDA suggesting that the relationship between income and normalized crime rates should be weaker or must involve more complex interactions.
- Framework for Hypothesis Testing: It allows the tester to see whether income variables remain statistically significant in predicting the variation of crime, even though the correlations are weak, while controlling for other factors (such as population and age structure).
- Interpretability: The coefficients of a linear regression model provide relatively simple interpretations about the effect of each predictor on the outcome variable as long as model assumptions are somewhat respected.
- Baseline Model: This is quite a well-understood baseline model. Its findings may be used to motivate research into more elaborate models for further complexity.
- Handling Multivariate Complexity: EDA suggested that no single variable predominated across all types of crime. Regression makes it possible to assess multivariate effects and partial effects for several predictors at one time.
So, linear regression was used not to make the best possible predictive model, but to regularly check the exact role of income and demographic factors in changing crime along with a clear, understandable framework. The models were run separately for crime counts and crime rates because EDA indicated different patterns.
As well, to justify our choice of analytical methods, we assessed the structural properties of our dataset in terms of its distribution, dispersion, and model fit criteria using both visualizations and statistical diagnostics. The basic consideration when handling count data like crime incident reports is dispersion, the measure of variance relative to the mean. Conventional assumptions held in Poisson regression modeling count data are that the variance equals the mean. As demonstrated by the dispersion heatmap, however, extreme overdispersion is exhibited by our dataset for all types of crime as well as combinations with predictors; 14 to more than 82 are variance-to-mean ratio values. For example, on ASSAULT crimes for 2021, dispensation reaches 82.19, meaning that its variance is more than 80 times its mean, which makes the Poisson regression really inappropriate. Likewise, the linear regression, which homoscedasticity and normally distributed errors assume, becomes much less appropriate because of the skewness and left-hand side-marshy nature of our data. Quasi-Poisson models can sometimes deal with modest overdispersion (dispersion ratios 1.0-1.5), but here, our values far exceed that threshold.
To formally compare performances, we calculated Akaike Information Criterion (AIC) values for our models of Poisson and Negative Binomial regressions across types of crime and predictors (total population and low-income percent, both for 2016 and 2021) indicated in the bar plots). The Negative Binomial models have lower AIC values under all scenarios, therefore confirming a better balance between goodness-of-fit and model complexity.
from IPython.display import Image, HTML, display
# Display correlation matrices
display(HTML("<h3>Correlation Matrices</h3>"))
display(HTML("<h4>2016 Socioeconomic Factors vs Crime</h4>"))
display(Image(filename='output/correlational_matrix/socio_crime_correlation_2016.png', width=800))
display(HTML("<h4>2021 Socioeconomic Factors vs Crime</h4>"))
display(Image(filename='output/correlational_matrix/socio_crime_correlation_2021.png', width=800))
# display(HTML("<h4>Change in Correlations (2021 vs 2016)</h4>")) # Optional: Change matrix
# display(Image(filename='output/correlational_matrix/socio_crime_correlation_change.png', width=800))
# Display additional EDA plots if relevant (e.g., specific crime comparisons)
display(HTML("<h4>Assault Counts in Improvement Areas vs Non-Improvement Areas</h4>"))
display(Image(filename='output/box_plot/improvement_area_assault_comparison.png', width=600))
Correlation Matrices
2016 Socioeconomic Factors vs Crime
2021 Socioeconomic Factors vs Crime
5.4 Implementation of Statistical Techniques¶
Linear regression models were implemented using the LinearRegression class from Python's scikit-learn library within the lin_reg.py script.
Model Structure: For each crime type (Assault, Auto Theft, etc.) and for both crime counts and crime rates, separate multivariate linear regression models were estimated for 2016 data and 2021 data.
- Dependent Variables (Targets): Crime Count (e.g.,
ASSAULT_2016,ASSAULT_2021) or Crime Rate (e.g.,ASSAULT_RATE_2016,ASSAULT_RATE_2021- assuming rates were calculated and included). - Independent Variables (Predictors): A consistent set of demographic and socioeconomic variables was used across models:
'Total_Population''Youth_Ratio''Working_Age_Ratio''Senior_Ratio''Average_Age''Median_Age'- Income metrics available for the respective year (e.g.,
'Median_Income_2020','Average_Income_2020','Median_After_Tax_Income_2020','Average_After_Tax_Income_2020'used for 2021 models; corresponding 2016 income metrics for 2016 models). Note: The text specifically lists 2020 income metrics, likely from the 2021 census data.
- Dependent Variables (Targets): Crime Count (e.g.,
Implementation Process:
- Relevant predictor and target variables were taken for each crime type/metric combination.
- A
LinearRegressionmodel was created and fitted (model.fit(X, y)). - Model performance was checked mainly with the R² score (coefficient of determination), showing the part of variation in the dependent variable that predictors explain
- Model coefficients were taken out to know how direction and size of relationships.
- Some plots were made to help interpretation and check assumptions (like scatter plots of key relationships, actual vs predicted plots to check fit)
- Results (R² scores, coefficients) were compiled, potentially into summary files (like
regression_summary_2016.csv) and comparison plots (liker_squared_comparison_2016.png,crime_counts_regression.png,crime_rates_regression.png).
This helped compare model performance across types of crime, between counts and rates, also between these two years (2016 and 2021).
What we also realized is that, for example, to carefully look at how ASSAULT counts connect with demographic predictors like total population and low-income percentage, we took a three-stage statistical modeling process suited to the nature of count data. Our first basic model used linear regression to create a reference point, but this method soon turned out to be not enough. As seen in a graph that compares real data with model predictions, the distribution of ASSAULT counts is very right-skewed as it breaks the main rules of linear regression, that the residuals should be normally distributed and homoscedastic (i.e., have constant variance). This mismatch showed that linear regression would not give accurate or understandable estimates, particularly for neighborhoods with higher counts of assault incidents.
As we were dealing with discrete, non-negative count data, it was important to switch to Poisson regression. Poisson regression is commonly used in count-based outcome modeling and has found its applications primarily in epidemiology and insurance, among many other disciplines. The basic assumption made by the models is that the mean and variance of the response variable are equal (equidispersion), so under ideal conditions, they fit perfectly well with count data. However, a more careful examination of our dataset revealed that the ASSAULT counts’ variance significantly exceeded its mean, a typical manifestation of overdispersion. This was statistically confirmed by computing the dispersion ratio and visually demonstrated by a QQ plot of Pearson residuals for Poisson model, in which strong deviation from reference line is seen at upper quantiles. These quantiles correspond to neighborhoods with high assault counts, where the Poisson model underestimates the variance, resulting in bad predictive power and inflated residuals.
We used Negative Binomial regression to deal with this overdispersion and make a stronger model. This model changes Poisson regression by adding a dispersion parameter which allows the variance to be more than the mean, making it much better for real-world data where the type of variability is not equal across the range of observations. Another plot, showing the QQ plot of Pearson residuals for the Negative Binomial model, shows better alignment with theoretical quantiles, especially in the upper tail of the distribution and measures overall fit quite a bit better than Poisson model does.
5.5 Evaluation of Model Assumptions and Limitations¶
Linear regression depends on some key assumptions.
* Linearity: This means it is assumed there is a direct additive relationship between the predictors and the outcome variables. EDA suggested that this was more plausible for crime counts (which often scale with population-related predictors) than for crime rates. A visual inspection of scatter plots and residual plots (if generated) would be needed for a formal check. Where linearity was weak (especially for rates), the model might underfit. * Independence: Assumes observations are independent. This might be violated if spatial autocorrelation exists (nearby neighbourhoods influencing each other), which was not explicitly modeled. * Homoscedasticity: Assumes constant variance of errors. Checked visually using residual plots. * Normality of Errors: Assumes errors are normally distributed. Checked using histograms or Q-Q plots of residuals.
Poisson regression assumptions.
* Linearity (on the log scale): Assumes a log-linear relationship between the predictors and the expected value of the count outcome * Independence: Observations are assumed to be independent. This assumption may be violated in spatial or temporal data (e.g., crime counts in adjacent neighborhoods), especially if autocorrelation is present. * Equidispersion: Poisson regression assumes that the variance of the outcome equals the mean * Count Outcome: The response variable must be a count (0 or positive integers). Not suitable for continuous or negative values.
Negative Binomial regression assumptions
* Linearity (on the log scale): Assumes a log-linear relationship between the predictors and the expected value of the count outcome * Independence: Assumes observations are independent. Violations due to spatial or temporal dependencies should be accounted for with appropriate models * Overdispersion: Negative binomial regression explicitly accounts for overdispersion in the data using an additional dispersion parameter. It relaxes the strict Poisson assumption of equal mean and variance. * Count Outcome: The response variable must be a count (0 or positive integers). Not suitable for continuous or negative values.
Linear Model Fit (R²):
As noted in the report text and visualized in plots like r_squared_comparison_2016.png (and presumably a similar one for 2021):
* R² scores were generally higher for models predicting crime counts compared to crime rates. For instance, the 2016 Assault count model reportedly achieved R² ≈ 0.50, while the rate model was ≈ 0.31. Break and Enter showed an even larger gap (count R² > 0.42 vs. rate R² < 0.07).
* This suggests that the included demographic and income variables (especially population) explain a considerable portion of the volume of crime but are less effective at explaining the risk of crime per capita (rates). Crime rates likely depend more on other localized factors not captured in this model (e.g., specific land use, policing strategies, social cohesion, factors not included in census profiles).
Limitations: * Model Specification: The chosen linear model might be too simple if true relationships are non-linear or involve complex interactions. * Omitted Variable Bias: Important factors influencing both income and crime might be missing from the model (e.g., detailed land use, specific policing activities, social program availability, educational attainment levels beyond age ratios), leading to potentially biased coefficient estimates. * Ecological Fallacy: Findings at the neighbourhood level do not necessarily apply to individuals within those neighbourhoods. * Data Limitations: As mentioned in Section 4.4 (boundary changes, potential COVID-19 impact on 2021 data, reporting variations, measurement error in income/crime data). * Temporal Granularity: Uses census data (every 5 years) and potentially aggregated crime data, missing finer temporal dynamics.
from IPython.display import Image, HTML, display
# Display R-squared heatmap (from linear regression fitness)
display(HTML("<h3>Model Fitness Overview (R-squared Heatmap - Linear Models)</h3>"))
display(Image(filename='output/fitness/r_squared_heatmap.png', width=700))
# Display general diagnostic plots for linear models
display(HTML("<h3>Example Diagnostic Plots (Linear Models)</h3>"))
display(Image(filename='output/model_check/diagnostic_plots.png', width=800)) # General check
Model Fitness Overview (R-squared Heatmap - Linear Models)
Example Diagnostic Plots (Linear Models)
6.1 Relationship Between Low Income and Crime Types¶
The scatter plots provide a visual representation of the relationship between various crime types and socioeconomic factors across Toronto neighborhoods. Analyzing these relationships reveals several notable patterns:
Low Income vs. Crime Type (2016 and 2021)¶
from IPython.display import Image, HTML, display
display(HTML("<h3>Low Income vs Crime Type Scatter Plots</h3>"))
display(Image(filename='output/linear_regression/low_income_vs_crime_2016.png', width=900))
display(Image(filename='output/linear_regression/low_income_vs_crime_2021.png', width=900))
Low Income vs Crime Type Scatter Plots
When examining the relationship between low income percentage and crime incidents, we observe varying strengths of association across different crime categories:
Assault¶
The data shows the strongest relationship between low income percentage and assault rates, with R² values of 0.250 in 2016 and increasing to 0.292 in 2021. The regression line equation changed from y = 5.26x + 7.37 in 2016 to y = 11.88x + 32.88 in 2021, indicating a steeper slope and suggesting the relationship between low income and assault rates strengthened over this five-year period.
Robbery¶
Robbery shows a moderate correlation with low income percentage (R² = 0.121 in 2016, increasing to 0.214 in 2021). The regression coefficient increased from 0.89x in 2016 to 1.44x in 2021, suggesting a growing association between robbery incidents and areas with higher percentages of low-income households.
Break and Enter¶
Break and enter incidents show a weak relationship with low income in 2016 (R² = 0.009), which increased somewhat by 2021 (R² = 0.096). This suggests that while there is some association, other factors likely play more significant roles in explaining break and enter patterns across neighborhoods.
Theft Types¶
Different categories of theft show varying relationships with low income:
- Theft from motor vehicles ("THEFTFROMMV") shows minimal correlation with low income (R² = 0.061 in 2016, decreasing to 0.021 in 2021)
- Bike theft shows very weak correlation (R² = 0.003 in 2016, increasing to 0.131 in 2021)
- Auto theft shows practically no correlation with low income percentage (R² = 0.001 in 2016, R² = 0.008 in 2021), suggesting these crimes may be more opportunistic or tied to factors other than neighborhood income levels
Population vs. Crime (2016 and 2021)¶
from IPython.display import Image, HTML, display
display(HTML("<h3>Population vs Crime Type Scatter Plots</h3>"))
display(Image(filename='output/linear_regression/population_vs_crime_2016.png', width=900))
display(Image(filename='output/linear_regression/population_vs_crime_2021.png', width=900))
Population vs Crime Type Scatter Plots
The data also reveals relationships between population size and crime incidents:
- Most crime types show moderate positive correlations with population size, which is expected as more populous areas tend to have higher absolute numbers of incidents
- Break and enter shows the strongest relationship with population in 2016 (R² = 0.380)
- Assault shows stronger correlation with population in 2016 (R² = 0.354) than in 2021 (R² = 0.244)
- Theft-related crimes generally show moderate correlation with population, with R² values mostly between 0.150-0.324
Temporal Changes (2016 vs. 2021)¶
Several notable changes occurred between 2016 and 2021:
- The relationship between low income and assault strengthened (R² increased from 0.250 to 0.292)
- The relationship between low income and robbery nearly doubled in strength (R² increased from 0.121 to 0.214)
- Break and enter correlation with low income increased substantially (R² from 0.009 to 0.096)
- Bike theft showed a marked increase in correlation with low income (R² from 0.003 to 0.131)
These findings suggest that socioeconomic factors, particularly the percentage of low-income households in a neighborhood, have a varying but generally strengthening relationship with different crime types over the studied period. The strongest relationships are observed with person-directed crimes like assault and robbery, while property crimes show more variable associations with low income levels.
6.2 Regression Model Performance Interpretation¶
from IPython.display import Image, HTML, display
# R Heat Map
display(HTML("<h3>Poisson and NB Model Heat Maps</h3>"))
display(Image(filename='output/count_model_fitness/nb_r_heatmap.png', width=900))
display(Image(filename='output/count_model_fitness/poisson_r_heatmap.png', width=900))
# R^2 Heat Map
display(Image(filename='output/count_models/nb_pseudo_r2_heatmap.png', width=900))
display(Image(filename='output/count_models/poisson_pseudo_r2_heatmap.png', width=900))
Poisson and NB Model Heat Maps
6.2.1 Overview of Model Performance¶
The visualizations present correlation coefficients (R) and Pseudo R² values for both Negative Binomial and Poisson regression models across four crime types (Assault, Auto Theft, Breaking and Entering, and Robbery) and four predictors (low income percentage and total population for years 2016 and 2021).
6.2.2 Correlation Analysis (R Values)¶
Both the Negative Binomial and Poisson models show remarkably similar correlation patterns:
Population vs. Crime: Total population (both 2016 and 2021) demonstrates the strongest positive correlations with all crime types, with coefficients typically ranging from 0.45 to 0.62. This suggests population size is the strongest predictor of crime counts across all categories. Low Income vs. Crime: Low income percentage shows positive but more moderate correlations:
Strongest for Assault (0.39-0.51) Moderate for Robbery (0.29-0.46) Weaker for Breaking and Entering (0.08-0.30) Negligible for Auto Theft (0.03-0.08)
Temporal Changes: Correlations are generally stronger for 2021 low income data compared to 2016 data, suggesting increased predictive power of income factors over time.
6.2.3 Model Fit Analysis (Pseudo R²)¶
The Pseudo R² values indicate the proportion of variance explained by each model:
Population-Based Models: Total population consistently explains 26-42% of the variance across crime types, with:
- Highest explanatory power for Breaking and Entering (41-42%)
- Strong explanatory power for Auto Theft (34-41%)
- Moderate explanatory power for Assault (31-40%)
- Slightly lower explanatory power for Robbery (22-38%)
Income-Based Models: Low income percentage has variable explanatory power:
- Moderate for Assault (28-35%)
- Low-to-moderate for Robbery (15-26%)
- Very low for Breaking and Entering (0.9-12%)
- Essentially non-existent for Auto Theft (0.4-1.8%)
Model Comparison: The Negative Binomial and Poisson models produced very similar Pseudo R² values, suggesting both are capturing similar patterns in the data.
6.2.4 Crime-Specific Insights¶
- Assault: Shows consistent moderate-to-strong relationships with both population and income variables. Total population in 2016 explains approximately 39-40% of variance while low income in 2021 explains about 33-35%.
- Auto Theft: Strongly associated with population (35-41% variance explained) but shows almost no relationship with income factors (< 2% variance explained), suggesting socioeconomic factors play minimal role in predicting auto theft rates.
- Breaking and Entering: Demonstrates the strongest association with population (42% variance explained) among all crime types but weak association with income. This suggests opportunity factors may be more important than socioeconomic drivers.
- Robbery: Shows more balanced associations with both population and income compared to other crime types, though population remains the stronger predictor.
6.3 Answering the Research Question¶
What is the relationship between household income levels and crime rates across Toronto neighborhoods, and how do these trends vary by type of crime?
Based on our analysis:
- There seems to be a positive relationship between indicators of lower socioeconomic status, such as higher low-income percentage, and higher neighbourhood crime counts. Higher average/median income seems to be associated with lower crime counts, but the strength and significance vary.
- The relationship is significantly weaker when looking at crime rates per capita, which suggests that richer neighbourhoods might have fewer crimes overall due to lower density or other factors, but the risk per person isn't solely determined by the neighbourhood's average income level. Other factors play a considerable role in determining rates.
- Depending on the crime type, the strength and significance of the income-crime correlation will vary. Some crimes (e.g., potentially assault, certain property crimes) show a clearer link to socioeconomic factors included in the model than others.
- There are indications that these relationships might have shifted between 2016 and 2021, with the predictive power of the model changing for different crime types and metrics (counts vs. rates).
Overall Conclusion from Results: Our analysis and findings suggest that while income levels are correlated with crime patterns in Toronto neighbourhoods (especially crime volume), it is not the sole contributing factor, particularly for per capita crime risk. The relationship is deeply complex, varies by crime type and potentially over time, and interacts heavily with population density and other demographic factors. The included variables provide only a partial explanation, highlighting the need to consider other unmeasured factors.
7. Discussion¶
7.1 Main Findings¶
Our analysis reveals critical insights about crime patterns in Toronto neighborhoods and the appropriate analytical methods:
First, crime data exhibits extreme overdispersion, with variance-to-mean ratios of 14-82 across crime types. This statistical property invalidates standard linear regression and necessitates count models, specifically Negative Binomial regression.
Second, we found significant socioeconomic patterns: population size strongly predicts all crime types (Pseudo R² 0.24-0.42), while low-income percentage strongly predicts violent crimes but less so property crimes. Assault shows the strongest relationship with low income (R² ~0.34), with each percentage point increase associated with approximately 10% more assaults.
Third, our methodological comparison demonstrates why appropriate statistical approaches matter - linear models generated impossible negative predictions and systematically underestimated high-crime areas due to violated assumptions.
7.2 Limitations¶
The study's findings should be considered in light of several limitations:
Linear Model Constraints: Linear regression assumes linear relationships and may not capture complex, non-linear dynamics or threshold effects between socioeconomic factors and crime. The relatively low R² values for crime rates suggest that linear models with the current predictors are insufficient to fully explain the relationship of crime activities and income levels of the area.
Omitted Variables: The models likely suffer from omitted variable bias. Factors not included but potentially correlated with both income and crime (e.g., detailed land use patterns, proximity to transit hubs, specific policing strategies, community-level social capital, educational attainment details, unemployment rates not captured by census income data alone) could significantly influence the observed relationships.
Crime Data Issues:
Underreporting: Official crime statistics only capture reported incidents. Actual crime levels may be higher, and reporting rates might vary systematically across neighbourhoods (e.g., based on trust in police, types of crime), potentially biasing the data.
Definition Changes: Changes in crime definitions or recording practices over time could affect comparability.
Temporal Factors: The comparison between 2016 and 2021 is potentially confounded by unique events, most notably the COVID-19 pandemic, which significantly altered social patterns and likely crime trends in 2021. This means despite discovering potential correlation between income and crime, without comparing and doing anaysis with other data, it could becomes another confounding variable.
Spatial Dependencies: The analysis does not account for spatial autocorrelation – the possibility that crime levels in one neighbourhood are influenced by those in adjacent neighbourhoods. In the real world, it is most likely that neighbourhoods are affected by each other.
7.3 Robustness and Future Directions¶
While the linear regression provides a baseline understanding, the limitations highlight areas for future research to improve robustness:
Advanced Modeling: Explore
Richer Data: Incorporate additional variables, such as:
- Detailed land use data (residential, commercial, industrial mix).
- Access to public resources (transit, parks, community centres).
- Measures of inequality within neighbourhoods (not just average income).
- More direct measures of unemployment or economic precarity.
- Data on policing deployment or community initiatives.
- Potentially, survey data on unreported crime or fear of crime.
Temporal Analysis: Utilize more granular time-series data if available to better understand short-term dynamics and causality.
Qualitative Research: Complement quantitative analysis with qualitative studies (e.g., interviews with residents, community leaders, law enforcement in high- and low-crime areas) to understand the lived experiences and mechanisms behind the observed statistical patterns.
Conclusion: These findings align with criminological theory suggesting different causal mechanisms for violent versus property crimes, while emphasizing that social context significantly influences crime patterns. The robust relationship between socioeconomic factors and crime highlights potential avenues for intervention beyond traditional policing strategies.The complexity of other potential variables are still awaiting to be uncovered and hopefully by uncovering them, the government can take further steps to help resolve or ease the problem.
8. References¶
- Lau, M. (2024). Numbers don’t lie—crime up significantly in Toronto and across Canada. Fraser Institute. https://www.fraserinstitute.org/commentary/numbers-dont-lie-crime-up-significantly-in-toronto-and-across-canada
- Better Dwelling. (n.d.). Stat Can: Toronto Takes The Top Spot For The Highest Ratio Of Poverty In Canada. Retrieved from https://betterdwelling.com/stat-can-toronto-takes-the-top-spot-for-the-highest-ratio-of-poverty-in-canada/ (Note: Original report cited this URL, verify source reliability)
- Government of Canada, Department of Justice. (2022). Exploring the link between crime and socio-economic status in Ottawa and Saskatoon: A small-area geographical analysis. https://www.justice.gc.ca/eng/rp-pr/csj-sjc/crime/rr06_6/p2.html
- Mohammadi, A., Bergquist, R., Fathi, G., Pishgar, E., de Melo, S. N., Sharifi, A., & Kiani, B. (2022). Homicide rates are spatially associated with built environment and socio-economic factors: A study in the neighbourhoods of Toronto, Canada. BMC Public Health, 22, 1482. https://doi.org/10.1186/s12889-022-13807-4
- Social Planning Toronto. (n.d.). Toronto's critical situation by the numbers. Social Planning Toronto. https://www.socialplanningtoronto.org/poverty_by_the_numbers
- Westin, M. (2021). Paper One. Telling Stories with Data. https://tellingstorieswithdata.com/inputs/pdfs/paper_one-2021-Morgaine_Westin.pdf (Note: Verify the relevance and context of this specific paper if possible, seems like a course resource link)
Additional references from web search results providing context on methodology structure:
- CollegeEssay.org Blog. (Jan 7, 2025). How To Write The Methods Section of a Research Paper Step-by-Step. https://collegeessay.org/blog/how-to-write-a-research-paper/research-paper-methods-section
- Sacred Heart University Library. (n.d.). Organizing Academic Research Papers: 6. The Methodology. https://library.sacredheart.edu/c.php?g=29803&p=185928
9. Appendix¶
Please see all data, python code and outputs in the following Github Repository: https://github.com/shirleychen003/inf412final